Irish Road Safety¶

I wanted to investigate trends in Irish road safety using public collision data from 2005 - 2020.

Questions:

Is there a month with a significant difference in collisions?

Is there a trend in collisions on Irish roads from 2005 to 2020? What predictions can we make?

Is there a County in Ireland that has a higher/lower amount of fatal collisions per capita?

Data Source

Data downloaded from Ireland's open Data initiative data.gov.ie

2002 - 2005 collision aggregate: https://data.gov.ie/dataset/roa17-traffic-collisions-and-casualties/resource/06ab21a7-92e5-4506-addc-f8e2486a8dfc

2013 - 2020 collision by County: https://data.gov.ie/dataset/roa27-traffic-collisions-and-casualities

County population data 2016: https://data.gov.ie/dataset/population-classified-by-area

Conclusion

There appears to be a downward trend in collisions on Irish roads for the time between 2005 - 2020. A low R-squared value of 0.05 indicates that the linear regression model has limited predictive power and is not able to explain the variation in the data effectively.

Further analysis and the inclusion of additional variables are recommended for a more accurate prediction of collision rates.

From 2013 to 2020 Longford has the highest total amount of fatal collisions on Irish roads while Dublin has the lowest amounts of fatal collisions.

Notes

Categories of 'Statistic label' include outcome of collision such as 'Fatal Collisions','Injury Collisions','All Fatal and Injury Collisions', 'Killed Casualties', 'Injured Casualties' 'All Killed and Injured Casualties'.

In [1]:
# import libraries 
import pandas as pd # pandas for panel data
import matplotlib.pyplot as plt # pyplot for visualization
import seaborn as sns # seaborn for visualization
from sklearn.linear_model import LinearRegression # linear regression model for prediction of collisions
import plotly.express as px # plotly for interactive plots
import numpy as np # import numpy to make jitter for regression plot
In [2]:
# load the data

# Irish road collision data as monthly aggregates from 2005 - 2020
df = pd.read_csv("../data/ROA17.20230921155733.csv")

# Irish road collision data per County from 2013 to 2020
df_county_collision = pd.read_csv("../data/ROA27.20230924131436.csv")

# Irish County population data from 2016
df_county_pop = pd.read_csv("../data/county_population_2016.csv")
In [3]:
# Display the first few rows
df.head()
Out[3]:
STATISTIC Statistic Label TLIST(A1) Year C01885V02316 Month of Year UNIT VALUE
0 ROA17C1 Fatal Collisions 2005 2005 - All months Number 360.0
1 ROA17C1 Fatal Collisions 2005 2005 01 January Number 31.0
2 ROA17C1 Fatal Collisions 2005 2005 02 February Number 34.0
3 ROA17C1 Fatal Collisions 2005 2005 03 March Number 23.0
4 ROA17C1 Fatal Collisions 2005 2005 04 April Number 20.0
In [4]:
# Summary statistics - how are the values distributed here

# List of columns to include in the description (excluding 'Year')
columns_to_include = [col for col in df.columns if col != 'Year']

# Use describe() on the selected columns
df[columns_to_include].describe()
Out[4]:
TLIST(A1) VALUE
count 1248.00000 1247.000000
mean 2012.50000 719.630313
std 4.61162 1504.400230
min 2005.00000 4.000000
25% 2008.75000 24.000000
50% 2012.50000 478.000000
75% 2016.25000 642.000000
max 2020.00000 10037.000000
In [5]:
# Data information, data types, only missing data is single NULL in VALUE
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1248 entries, 0 to 1247
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   STATISTIC        1248 non-null   object 
 1   Statistic Label  1248 non-null   object 
 2   TLIST(A1)        1248 non-null   int64  
 3   Year             1248 non-null   int64  
 4   C01885V02316     1248 non-null   object 
 5   Month of Year    1248 non-null   object 
 6   UNIT             1248 non-null   object 
 7   VALUE            1247 non-null   float64
dtypes: float64(1), int64(2), object(5)
memory usage: 78.1+ KB
In [6]:
# What kind of statistics are being labelled
df['Statistic Label'].unique()
Out[6]:
array(['Fatal Collisions', 'Injury Collisions',
       'All Fatal and Injury Collisions', 'Killed Casualties',
       'Injured Casualties', 'All Killed and Injured Casualties'],
      dtype=object)
In [7]:
# Create a copy of the original DataFrame and filter out 'All months' column to make plotting and merging easier
filtered_data = df.copy()

# Exclude rows where 'Month of Year' is 'All months' for plotting
filtered_data = filtered_data[filtered_data['Month of Year'] != 'All months']
In [8]:
# generate a plot looking at collision data at different months of the year

# Create a FacetGrid with 'Statistic Label' as the categorical variable
g = sns.FacetGrid(filtered_data, col='Statistic Label', col_wrap=3, height=4, sharey= False)

# Ensure correct order of months for the y-axis
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July','August','September', 'October','November', 'December']

# use a palette for month colours
month_colors = sns.color_palette("husl", n_colors=len(month_order))

# Create bar plots in each grid
g.map(sns.barplot, 'Month of Year', 'VALUE' ,order = month_order, palette = month_colors)
g.set_axis_labels('Month', 'Collision Value')
g.set_titles(col_template='{col_name}')

# Adjust the layout
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Monthly Statistics by Statistic Label', fontsize=16)

g.set_xticklabels(rotation = 90)
plt.tight_layout()
plt.show()
In [9]:
# generate a plot looking at different collision statistics from 2005 - 2020

# Create a FacetGrid with 'Statistic Label' as the categorical variable
g = sns.FacetGrid(filtered_data, col='Statistic Label', col_wrap=3, height=4, sharey= False)

# Create bar plots in each grid
g.map(sns.lineplot, 'Year', 'VALUE')
g.set_axis_labels('Year', 'Collision Value')
g.set_titles(col_template='{col_name}')

# Adjust the layout
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Yearly Statistics by Statistic Label', fontsize=16)

# select only every 5 years for labelling
selected_years = [2005, 2010, 2015, 2020]

# Show the plot
plt.xticks(selected_years,rotation=90)  # Rotate x-axis labels for better visibility
plt.tight_layout()
plt.show()
In [10]:
# Can use a linear regression model to predict how many 'Injury collisions' there will be in 2025.

# Get X and Y variables

injury_collision_df = filtered_data[filtered_data['Statistic Label'] == 'Injury Collisions']

# Create a linear regression model
regressor = LinearRegression()

# Fit the model to your data
X = injury_collision_df[['Year']]
y = injury_collision_df['VALUE']
regressor.fit(X, y)

# Check the R-squared value
r_squared = regressor.score(X, y)
print(f"R-squared value: {r_squared:.4f}")
R-squared value: 0.0528
In [11]:
# will get a warning over 'feature' labelling, but this does not effect the predictive model
import warnings
warnings.filterwarnings('ignore')

# Predict the collision data for 2025
predicted_2025 = regressor.predict([[2025]])
print("Predicted collision data for 2025:", predicted_2025[0])

predicted_2030 = regressor.predict([[2030]])
print("Predicted collision data for 2030:", predicted_2030[0])
Predicted collision data for 2025: 424.85477941176487
Predicted collision data for 2030: 407.91544117647027
In [12]:
plt.scatter(X, y, color='blue', s=6)
plt.plot(X, regressor.predict(X), color='red', label='Linear Regression Trend')
plt.xlabel('Year')
plt.ylabel('Collision Value')
plt.title('Collision Data Trend (2005-2020)')
plt.legend()
plt.show()

Looking at Collision data per Irish County¶

In [13]:
df_county_collision.head(), df_county_pop.head()
Out[13]:
(  STATISTIC   Statistic Label  TLIST(A1)  Year C02451V02968        County  \
 0  ROA27C01  Fatal Collisions       2013  2013            -  All Counties   
 1  ROA27C01  Fatal Collisions       2013  2013           01        Carlow   
 2  ROA27C01  Fatal Collisions       2013  2013           02        Dublin   
 3  ROA27C01  Fatal Collisions       2013  2013           03       Kildare   
 4  ROA27C01  Fatal Collisions       2013  2013           04      Kilkenny   
 
      UNIT  VALUE  
 0  Number    179  
 1  Number      1  
 2  Number     18  
 3  Number     13  
 4  Number      4  ,
      County Population(per 1,000)
 0     Cavan                  76.2
 1   Donegal                 159.2
 2   Leitrim                    32
 3  Monaghan                  61.4
 4     Sligo                  65.5)
In [14]:
# Create a copy of the original collision DataFrame and filter out 'All months' column to make plotting and merging easier
filtered_collision = df_county_collision.copy()

# Exclude rows where 'Month of Year' is 'All months' for plotting
filtered_collision = filtered_collision[filtered_collision['County'] != 'All Counties']
In [15]:
# ensure 26 counties are there in each df and that they are labelled the same
len(filtered_collision.County.unique()), len(df_county_pop.County.unique())
Out[15]:
(26, 26)
In [16]:
# ensure both DFs Counties match so can merge on 'County'

# Check if all county names in collision_data exist in population_data
all_counties_matched = all(filtered_collision['County'].isin(df_county_pop['County']))

# Check if all county names in population_data exist in collision_data
all_counties_matched_reverse = all(df_county_pop['County'].isin(filtered_collision['County']))

# Determine if all county names match in both directions
if all_counties_matched and all_counties_matched_reverse:
    print("All county names match in both data frames.")
else:
    print("County names do not match in both data frames.")
All county names match in both data frames.
In [17]:
# merge the collision data with the county population and create normalized value for collision 
merged_county = filtered_collision.merge(df_county_pop, on = 'County')

# change population column to type float and then calculate normalized value
merged_county['Population(per 1,000)'] = merged_county['Population(per 1,000)'].str.replace(',', '', regex=True)
merged_county['Population(per 1,000)'] = merged_county['Population(per 1,000)'].astype(float)

merged_county['Normalized Value'] = merged_county['VALUE'] / merged_county['Population(per 1,000)']
merged_county.head()
Out[17]:
STATISTIC Statistic Label TLIST(A1) Year C02451V02968 County UNIT VALUE Population(per 1,000) Normalized Value
0 ROA27C01 Fatal Collisions 2013 2013 01 Carlow Number 1 56.9 0.017575
1 ROA27C01 Fatal Collisions 2014 2014 01 Carlow Number 5 56.9 0.087873
2 ROA27C01 Fatal Collisions 2015 2015 01 Carlow Number 4 56.9 0.070299
3 ROA27C01 Fatal Collisions 2016 2016 01 Carlow Number 0 56.9 0.000000
4 ROA27C01 Fatal Collisions 2017 2017 01 Carlow Number 3 56.9 0.052724
In [18]:
# Create a plot looking at total number of fatal collisions, normalized per population, for each county
merged_county_fatal = merged_county[merged_county['Statistic Label'] == 'Fatal Collisions']

# group the data by County
group_by_county = merged_county_fatal.groupby('County')
In [19]:
# create a barplot to visualize different fuel contributions to primary energy

fig = px.bar(merged_county_fatal, x="County", y="Normalized Value", 
             color="County", 
             title="Total Normalized Fatal Collision Values by County (2013 - 2020)", 
             labels={"County": "County", "Normalized Value": "Total Normalized Collision Values"})
fig.update_xaxes(tickangle=-45) #rotate axes label and flip it
fig.show()